Open In Colab

Introduction¶

This article was originally created by Umair for a lab meeting presentation and later expanded by Kai to be more comprehensive. The main goal is to review linear regression.

The reason why we think this is important is that linear regression is perhaps the most basic problem in machine learning yet it actually has a close form solution. Many deep learning tutorials today directly go to network structure, training, testing, loss function and so on, but never really start back from the simplest form of addressing the most common problem of ordinary least-square regression.

Many real world problems that we encounter in real life can actually be solved by linear regression. Yet, many deep learning courses or tutorials started from less than ideal examples that could have been solved by linear regression. For example, predicting house prices from square footage of the house, or predicting grades in a course from number of hours spent on the course, etc.

So we created this article to help beginners better understand the relative merits of conventional regression methods with modern machine-learning methods. After reading this article, a reader will have understanding of the closed form solutions (optionally, LU decomposition and QR decomposition) to linear regression with or without L2 regularization, gradient descent solutions to linear regression with or without L2, L1 and Elastic Net.

Linear Regression¶

Suppose we have a training sample $S=\{(x_1, y_1),\ldots, (x_n, y_n)\}$, where each $x_i \in \mathbb{R}^k$ is a vector of $k$ features and $y_i \in \mathbb{R}$ is a real-valued label or prediction. We would like to learn from $S$ a regression model or function $f:\mathbb{R}^k\to\mathbb{R}$ that can accurately predict labels of new instances in $\mathbb{R}^k$.

A linear regression model $f$ takes the following form: $$f(x)=x^Tw$$ where $x \in \mathbb{R}^k$ is an instance, and $$w=\begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k \end{bmatrix}\in \mathbb{R}^k$$

$w_i$ are the weights or parameters of this linear regression model. We usually denote the predicted label $f(x)$ as $\hat{y}$, so $$\hat{y}:=f(x)=x^Tw$$

There are many real world examples of such data. $Y$ may be the housing prices and $X$ may be the square footage of the houses. $Y$ may be the grades in a class and $X$ may be the hours spent on studying on the class.

We can represent our data of feature instances as an $n\times k$ matrix $X$, whose rows are transpositions of $x_i$ instances, i.e. $$X=\begin{bmatrix}x_1^T\\x_2^T\\ \vdots \\ x_n^T \end{bmatrix} \in \mathbb{R}^{n \times k}$$

and our labels can be represented as an $n\times 1$ matrix $Y$

$$Y=\begin{bmatrix}y_1\\y_2\\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n$$

Under this notation, labels predicted by the linear regression model $f$ above can be written as

$$\hat{Y}:=f(X)=Xw=\begin{bmatrix}x_1^T\\x_2^T\\ \vdots \\ x_n^T \end{bmatrix} \begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k\end{bmatrix}$$

Bias term¶

It is possible in some situations that the linear relationship between the instances $x$ and labels $y$ may not be truly represented by a linear transformation, but can instead be represented by an affine transformation (which is a linear transformation followed by a translation). Here translation means a map of the form $T(\overrightarrow{y})=\overrightarrow{y}+\overrightarrow{c}$ where $\overrightarrow{c} \in \mathbb{R}^k$. The amount of translation added to the linear transformation is called bias. This is a more general form of linear regresson model, and it can be represented as $$f(x)=x^Tw+b$$ where $x \in \mathbb{R}^k$ is an instance, $$w=\begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k \end{bmatrix}\in \mathbb{R}^k$$ is the weight vector as before, and $b \in \mathbb{R}$ is the bias term. With the generalized form, labels predicted by the linear regression model $f$ above can be written as

$$\hat{Y}:=f(X)=Xw+b=\begin{bmatrix}x_1^T\\x_2^T\\ \vdots \\ x_n^T \end{bmatrix} \begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k\end{bmatrix}+\begin{bmatrix}b\\b\\ \vdots \\ b\end{bmatrix}$$

We are abusing the notation here a bit to represent column vector of $b$'s by the symbol $b$ again.

Moreover, we can treat bias term (or equivalently the affine transofrmation) as a linear transformation by appending 1 to each instance $x_i$ as the $k+1$th feature, and append to the weight vector $w$ a new parameter $w_{k+1}:=b$ representing bias term $b$, i.e.

$$\tilde{X}=\begin{bmatrix}x_1^T & 1\\x_2^T &1\\ \vdots \\ x_n^T &1\end{bmatrix}\qquad \tilde{w}=\begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k \\ w_{k+1} \end{bmatrix}$$

Now, our predictions can be written as

$$\hat{Y}=Xw+b=\begin{bmatrix}x_1^T\\x_2^T\\ \vdots \\ x_n^T \end{bmatrix} \begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k\end{bmatrix}+\begin{bmatrix}b\\b\\ \vdots \\ b\end{bmatrix}= \begin{bmatrix}x_1^Tw+b\\x_2^Tw+b\\ \vdots \\ x_n^Tw+b \end{bmatrix}=\begin{bmatrix}x_1^T & 1\\x_2^T &1\\ \vdots \\ x_n^T &1\end{bmatrix} \begin{bmatrix}w_1\\w_2\\ \vdots \\ w_k \\ w_{k+1} \end{bmatrix}=\tilde{X}\tilde{w}$$

Thus, we can treat bias term as another feature and we are back to the case where $X$ is just a $n\times k$ matrix and $w$ is a $k\times 1$ matrix. Therefore, in the remained of this presentation we will ignore the bias term, and assume it is being taken care of in the manner above.

Squared Loss¶

For our function to be a good predictor, we would like $\hat{Y}$ to be as close as possible to $Y$. We will define the closeness by square of error, i.e. we would like $(\hat{y_i}-y_i)^2$ to be as little as possible. For the whole dataset, we would define the loss w.r.t to parameter vector $w$ as

$$Loss_w =\dfrac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2=\dfrac{1}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)^2$$

Now our objective is to find $$\hat{w}=min_{w\in \mathbb{R}^k} Loss_w$$

Optimal Parameter¶

Setting gradient of $Loss_w$ w.r.t $w$ equal to 0, we obtain $$\nabla_w Loss_w=\dfrac{2}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)x_i=0$$ Rewriting this equation in terms of matrices, we get $$ X^TXw-X^Ty=0$$

$$X^TXw=X^Ty \qquad (\star)$$

If $X^TX$ is an invertible $k\times k$ matrix, we obtain a unique solution $$w=(X^TX)^{-1}X^Ty$$

If $X^TX$ is not invertible, we can instead use its pseudo-inverse $(X^TX)^+$ to obtain a solution $$w=(X^TX)^{+}X^Ty$$

This is not a unique solution, and in fact there are infinitely many solutions.

How to judge if a matrix is invertible? See an example below that I got online:

In [ ]:
import numpy as np  
mat = np.array([[1,0,0],[-1,3,3],[1,2,2]]) 
rank = np.linalg.matrix_rank(mat)
cond = np.linalg.cond(mat)
print ("rank=", rank, " cond=", cond)
rank= 2  cond= 5.345097829566925e+17

Because the 2nd and 3rd column are identical in the matrix, the rank is 2. The condition is also extremely high. Therefore, it is not invertible. You may want to run np.linalg.matrix_rank(mat) and see what you get; it could be an error message about singular value, or it could generate actual results, depending on your Python version within Colab.

In these cases, we can compute the Moore-Penrose pseudo-inverse of a matrix instead.

In [ ]:
import numpy as np  
mat = np.array([[1,0,0],[-1,3,3],[1,2,2]]) 
inv = np.linalg.pinv(mat)
print (inv)
np.dot(inv, mat) 
[[ 0.34210526 -0.26315789  0.39473684]
 [ 0.01315789  0.10526316  0.09210526]
 [ 0.01315789  0.10526316  0.09210526]]
Out[ ]:
array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.16333634e-17, 5.00000000e-01, 5.00000000e-01],
       [4.16333634e-17, 5.00000000e-01, 5.00000000e-01]])

Basically this function calculates the generalized inverse of a matrix using its singular-value decomposition (SVD) and including all large singular values. The results when multiplying inverted matrix with original matrix is actually not an identity matrix, because the second and third row/colume are both 0.5. In any case, let's move on here, knowing that co-linearity can be a challenge in linear regressions that we will address in the sections below through regularization.

Coding preparation¶

Here we will first import NumPy and Matplotlib, because these modules will be used in the examples below. We also define a solve_by_gradient_descent function here but we will not use it until later. Note that we create a docstring for the function, by placing a string literal in the first line. Because docstrings are usually multiple lines, we used Python's triple-quote notation for multi-line strings.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

Example 1¶

We create a set of 25 values using np.arange and a random function. Then calculate $Y$ as a function of $X$ (however, to make the problem more difficult, $Y$ is created not as a simple linear function, but more of a combination of several functions). Then plot $Y$ versus $X$. There is a general trend that $Y$ increases with $X$ but it is not monotonic, because of the use of sin() function below. We will call this dataset 'Dataset A'.

In [ ]:
rng=np.random.RandomState(3)
np.random.seed(543)

x=np.arange(5,30)/3+0.2*np.random.randn(25)+5
y=2*x-0.5*np.square(x-10)+3*np.sin(x+0.5)+1+0.5*np.random.randn(25)
y[::5]+=50*(0.5-rng.rand(5))
plt.scatter(x, y, s=30, c='red', marker='+')
plt.show()
In [ ]:
x, y
Out[ ]:
(array([ 6.56347837,  6.90468638,  7.36155461,  7.61217254,  7.83663237,
         8.50572885,  8.82459789,  8.99898214,  9.23107005,  9.46395953,
         9.9823754 , 10.42786634, 10.16821181, 11.04209398, 11.53256652,
        11.40559324, 12.32589701, 12.54873551, 13.09798103, 13.46266847,
        13.56169645, 13.90542475, 14.31070954, 14.12825122, 14.69545437]),
 array([ 9.17073128, 12.36943984, 14.41253873, 15.32495012, 16.90494901,
         8.19381999, 17.87332467, 18.00769794, 19.19394809, 18.49061105,
        29.17813602, 17.98533015, 18.50208979, 20.35720439, 21.2260356 ,
        20.27359827, 23.94399823, 23.66264186, 24.12110209, 25.83297486,
         5.94216416, 23.96504774, 21.94770807, 23.52732822, 20.20772171]))

We next stack 1 as an extra column to include bias term in our model, and create a new $X$ matrix with two columns.

In [ ]:
new_x=np.stack([x,np.ones(len(x))],axis=1)
new_x
Out[ ]:
array([[ 6.56347837,  1.        ],
       [ 6.90468638,  1.        ],
       [ 7.36155461,  1.        ],
       [ 7.61217254,  1.        ],
       [ 7.83663237,  1.        ],
       [ 8.50572885,  1.        ],
       [ 8.82459789,  1.        ],
       [ 8.99898214,  1.        ],
       [ 9.23107005,  1.        ],
       [ 9.46395953,  1.        ],
       [ 9.9823754 ,  1.        ],
       [10.42786634,  1.        ],
       [10.16821181,  1.        ],
       [11.04209398,  1.        ],
       [11.53256652,  1.        ],
       [11.40559324,  1.        ],
       [12.32589701,  1.        ],
       [12.54873551,  1.        ],
       [13.09798103,  1.        ],
       [13.46266847,  1.        ],
       [13.56169645,  1.        ],
       [13.90542475,  1.        ],
       [14.31070954,  1.        ],
       [14.12825122,  1.        ],
       [14.69545437,  1.        ]])

We next calculate the $X_{new}^TX_{new}$, which should be a 2x2 matrix.

In [ ]:
np.transpose(new_x).dot(new_x)
Out[ ]:
array([[3026.92222155,  267.89838837],
       [ 267.89838837,   25.        ]])

And then calculate its inverse.

In [ ]:
inv=np.linalg.inv(np.transpose(new_x).dot(new_x)).astype(np.float64)
inv
Out[ ]:
array([[ 0.00640449, -0.06863014],
       [-0.06863014,  0.7754362 ]])

Next, follow the closed form equation above to calculate the $w$ values, with dot product.

In [ ]:
w=inv.dot(np.transpose(new_x).dot(y))
w
Out[ ]:
array([1.15421245, 6.45613743])

Here is the mean squared error for training set.

In [ ]:
y_est=new_x.dot(w)    
error=np.mean(np.square(y-y_est))
print('\nMean Squared Error for training set:%.8f' %error)
Mean Squared Error for training set:21.93024489

Next, plot the prediction line (blue line below) and compare with the observed values (red cross below).

In [ ]:
plt.scatter(x, y, s=30, c='red', marker='+')
plt.plot(x, new_x.dot(w), c='blue')
plt.show()

In summary, in the above examples, we used closed-form equations to find the ordinary least square solution of $w$ ($w_0$ as slope and $w_1$ as intercept) that relates observed $Y$ to $X$ values.

Alternative approach 1 to tackle the problem: LU decomposition¶

Let us look at the solution above again carefully. We derived closed form solution through $$w=(X^TX)^{-1}X^Ty$$

This is a solution written in every textbook. However, is it practical? Note that matrix inversion is computationally expensive. If we instead simply look at the equation:$$X^TXw=X^Ty \qquad$$

It is essentially a linear equation problem (that is, the $a x = b$ problem) that we want to get value of $w$ (that is, the $x$ value). We can just treat this as a linear equation and solve it using NumPy's solve function, which internally use LU decomposition to solve linear equations. It could be one order of magnitude faster than the closed form solution above, especially when handling large matrics.

See below for the actual code. As you will see, the $w$ values are identical as the closed form solution above.

In [ ]:
a = np.transpose(new_x).dot(new_x)
b = np.transpose(new_x).dot(y)
w = np.linalg.solve(a,b)
w
Out[ ]:
array([1.15421245, 6.45613743])

Now, we will use the %%timeit magic command to calculate the time for each method (use %%lsmagic to see a list of all magic commands within colab). Since this is a small matrix, there is no much difference in speed, but you can get an idea.

In [ ]:
%%timeit
w1 = np.linalg.inv(np.transpose(new_x).dot(new_x)).dot(np.transpose(new_x).dot(y))
The slowest run took 19.69 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 24.3 µs per loop
In [ ]:
%%timeit
w2 =  np.linalg.solve(np.transpose(new_x).dot(new_x), np.transpose(new_x).dot(y))
The slowest run took 9.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 5: 14.6 µs per loop

Alternative approach 2 to tackle the problem: QR decomposition¶

Let us consider the generic problem of solving $$X^TXw=X^Ty \qquad$$

If we perform a QR decomposition of $X$, that is, find a solution to $X = Q R$, where $Q$ is orthogonal matrix and $R$ is upper-triangular matrix. Then we have $$X^T = (QR)^T = R^T Q^T$$ $$X^T X = (R^T Q^T Q R) = R^T R$$

So the problem is now reduced to: $$R^T R w = R^T Q^T y$$ $$R w = Q^T y$$

and we only need to solve this equation to get the value of $w$.

In [ ]:
(q, r) = np.linalg.qr(new_x)
a = r
b = np.transpose(q).dot(y)
w = np.linalg.solve(a,b)
w
Out[ ]:
array([1.15421245, 6.45613743])

Now, the question is why would we do this? This has something to do with condition number, which measures how much the output value of the function can change for a small change in the input argument. To improve numerical precision when handling very large matrics, using this approach, we essentially avoided the scenario of calculating $X^T X$ which has a very high condition number.

See the code below. The $R$ matrix has a much lower condition number than the $X^T X$ matrix. Therefore, this results in greater numerical precision in the calculation in real-world applications when handling very large and complex matrices.

In [ ]:
cond1 = np.linalg.cond(np.transpose(new_x).dot(new_x))
cond2 = np.linalg.cond(r)
cond1, cond2
Out[ ]:
(2384.116567408285, 48.82741614511562)

Alternative approach 3 to tackle the problem: gradient descent¶

Now if we consider a more complex scenario of a matrix with over 10 million dimensions. Say for example, we generated gene expression data on 20,000 genes from 10 million human cells, and we want to do some sort of regression analysis to predict which gene is most relevant to a specific property of the cell (assuming that the property of each cell is known by some type of measurement).

Traditional linear algebra, even QR decomposition, no longer works in this case. A closed form solution is not really computationally tractable in conventional computers. To solve this problem, we want to avoid any complex matrix operation, and use gradient descent instead to get approximately correct solutions. Iterative methods such as gradient descent have advantages when we have a large amount of data or the data is very sparse, which is exactly the problem when we have 10 million human cells. In reality, we probably want to use stochastic gradient descent (SGD), where we can use a mini-batch of training data per epoch rather than using all the training samples. A good explanation is given here.

Solution using gradient descent¶

Following our discussions above, rather than using closed-form solutions from traditional linear algebra, we try to use gradient descent to minimize the loss function with 100,000 steps.

Recall our loss function $$Loss_w =\dfrac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2=\dfrac{1}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)^2$$

And we want to find $$\hat{w}=min_{w\in \mathbb{R}^k} Loss_w$$

$Loss_w$ is a real valued function dependent on $w=[w_1,w_0]$. If we calculate the gradient of $Loss_w$, we obtain

$$\nabla_w Loss_w=\dfrac{2}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)x_i$$

From calculus we know that if $\nabla_w Loss_w\neq 0$, the function $Loss_w$ is increasing fastest in the direction of $\nabla_w Loss_w$ and decreasing fastest in the direction of $-\nabla_w Loss_w$.

We will move against the gradient to decrease the loss by taking a small step in the direction $$-\eta\nabla_w Loss_w$$ where $\eta$ is the step size. Since the loss function is convex, we are guaranteed that $w$ will converge to the unique minimum if the step size is suitable. We will update the parameter vector as $$w_{new}=w-\eta\nabla_w Loss_w$$

In [ ]:
w=np.array([1,1])
loss_list=[]
w_list=[]
b_list=[]
w_grad_list=[]
b_grad_list=[]
rate=0.001
for i in range(100000):
    w_list.append(w)
    y_est=new_x.dot(w)
    
    loss=np.mean(np.square(y-y_est))
            
    loss_list.append(loss)
    
    w_grad=2*(y_est-y).dot(new_x)/len(new_x)
    w=w-rate*w_grad
    w_grad_list.append(w_grad)
    

We can check the first 5 and last 5 values of the loss function $Loss_w$.

In [ ]:
loss_list[:5]+loss_list[-5:]
Out[ ]:
[72.61193759936884,
 51.53705277576042,
 39.49350369915836,
 32.61098612822289,
 28.677782706722965,
 21.93024489577661,
 21.93024489577622,
 21.93024489577582,
 21.930244895775427,
 21.930244895775033]

A plot of training loss $Loss_w$ during first 50 epochs.

In [ ]:
plt.scatter(range(1,len(loss_list[:50])+1),loss_list[:50], s=30, c='red', marker='+')
plt.show()

Here are the first 5 and last 5 gradient vectors $\nabla_w$. The extremely small values of gradient towards the end shows that the solution has converged, and there is not much to be gained by training any further.

In [ ]:
w_grad_list[:5], w_grad_list[-5:]
Out[ ]:
([array([-154.27836215,  -14.21733628]),
  array([-116.61456982,  -10.88242764]),
  array([-88.14268041,  -8.36139436]),
  array([-66.61939705,  -6.45560901]),
  array([-50.3489026 ,  -5.01491947])],
 [array([ 1.75399006e-06, -1.98095519e-05]),
  array([ 1.75381050e-06, -1.98075241e-05]),
  array([ 1.75363099e-06, -1.98054965e-05]),
  array([ 1.75345147e-06, -1.98034691e-05]),
  array([ 1.75327198e-06, -1.98014419e-05])])

Here are the first 5 and last 5 weight vectors $w$.

In [ ]:
w_list[:5], w_list[-5:]
Out[ ]:
([array([1, 1]),
  array([1.15427836, 1.01421734]),
  array([1.27089293, 1.02509976]),
  array([1.35903561, 1.03346116]),
  array([1.42565501, 1.03991677])],
 [array([1.15422959, 6.45594391]),
  array([1.15422959, 6.45594393]),
  array([1.15422959, 6.45594395]),
  array([1.15422958, 6.45594397]),
  array([1.15422958, 6.45594399])])

This is the last value of the gradient vector. Check it against the closed form solution.

In [ ]:
w
Out[ ]:
array([1.15422958, 6.45594401])

We now plot the learned model in blue line against the training dataset in red '+' symbols.

In [ ]:
plt.scatter(x, y, s=30, c='red', marker='+')
plt.plot(x, new_x.dot(w), c='blue')
plt.show()

Example 2¶

Here we will consider a 4-dimensional data, i.e. ur examples have four features $x,y,s,t$, and we will ignore bias term for now. We will call this 'Dataset 2', and will come back to it several times. Note that the feature $s$ is essentially feature $x$ with some gaussian noise. The label $z$ is a linear combination of all the features with some extra noise, but the coefficient of $t$ is almost zero.

In [ ]:
rng=np.random.RandomState(3)
np.random.seed(543)

x=np.arange(5,25,0.1)
y=5+10*np.random.randn(200)
s=x+0.1*np.random.randn(200)
t=-5-10*np.random.randn(200)

z=2*x+y+3*s+1e-6*t+2*np.random.randn(200)
z[::20]+=10*(0.5-rng.rand(len(z[::20])))

X=np.stack([x,y,s,t],axis=1)

from sklearn.model_selection import train_test_split

train_input, test_input, train_label, test_label = train_test_split(X,z, test_size=0.5)

First we will solve this using closed form, without any L2 regularization.

In [ ]:
inv=np.linalg.inv(np.transpose(train_input).dot(train_input))

w=inv.dot(np.transpose(train_input).dot(train_label))

print('Weight vector is')
print(w)

pred=train_input.dot(w)    
error=np.mean(np.square(train_label-pred))
print('\nMean Squared Error for training set:%.8f' %error)

pred=test_input.dot(w)    
error=np.mean(np.square(test_label-pred))
print('\nMean Squared Error for testing set:%.8f' %error)
Weight vector is
[ 6.34999142  1.0003759  -1.35219666  0.01582614]

Mean Squared Error for training set:4.06102838

Mean Squared Error for testing set:4.61530081

Now we will use gradient descent to solve this.

In [ ]:
def solve_by_gradient_descent(X,y,rate=0.001,l2=0,l1=0,early_drop=True,num_epochs=10000):
  """solve a Xw->y equation by gradient descent and return the solution"""
  print('Solving by gradient descent using \nL2-regularization penalty=%.4f \nL1-regularization penalty=%.4f\n' %(l2,l1))
  w=np.ones(X.shape[1])

  for i in range(num_epochs):
      
      y_est=X.dot(w)
      w_grad=2*(y_est-y).dot(X)/len(X) +2*l2*w + l1*np.sign(w)
      w=w-rate*w_grad

  loss=np.mean(np.square(y-y_est)) + l2*np.sum(np.square(w)) + l1*np.sum(np.abs(w))
  print('Last loss calculated')
  print(loss)

  print('\nLast gradient vector')
  print(w_grad)

  print('\nFinal value of weight vector w:')
  print(w)

  y_est=X.dot(w)    
  error=np.mean(np.square(y-y_est))
  print('\nMean Squared Error for training set:%.8f' %error)

  return w

Now we use the solve_by_gradient_descent function to solve it. Notice the last gradient is very small so we can be confident that we are close to convergence.

In [ ]:
w=solve_by_gradient_descent(train_input,train_label,early_drop=False,rate=0.0008,num_epochs=800000)
pred=test_input.dot(w)    
error=np.mean(np.square(test_label-pred))
print('\nMean Squared Error for testing set:%.8f' %error)
Solving by gradient descent using 
L2-regularization penalty=0.0000 
L1-regularization penalty=0.0000

Last loss calculated
4.061028604908347

Last gradient vector
[-4.85360505e-05  7.80602743e-08  4.85037157e-05  5.11682177e-08]

Final value of weight vector w:
[ 6.34537362  1.00038333 -1.34758194  0.01583101]

Mean Squared Error for training set:4.06102860

Mean Squared Error for testing set:4.61494387

L2-regularization¶

In this section, we want to introduce L2 regularization here, also called Ridge regression. Essentially we add a term that adds $w_j^{2}$ to the loss function, to force the model to decrease the absolute value of $w_j$ numbers. Suppose our training set has $n$ examples. This has a closed form solution as well as shown below.

For a $\lambda>0$, consider the following modified loss $$Loss_w =\dfrac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2 +\lambda\sum_{j=1}^k w_j^{2}=\dfrac{1}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)^2 +\lambda\sum_{j=1}^k w_j^{2}$$

And we want to find $$\hat{w}=min_{w\in \mathbb{R}^k} Loss_w$$

$Loss_w$ is a real valued function dependent on $w$.

Setting gradient of $Loss_w$ w.r.t $w$ equal to 0, we obtain $$\nabla_w Loss_w=\dfrac{2}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)x_i+2\lambda w=0$$ Rewriting this equation in terms of matrices, we get

$$(X^TX+n\lambda I_k)w=X^Ty \qquad (\star)$$

$(X^TX+\lambda I_k)$ is always an invertible $k\times k$ matrix, we obtain a unique solution $$w=(X^TX+n\lambda I_k)^{-1}X^Ty$$

Closed form solution for Dataset 2¶

In [ ]:
l=0.5
inv=np.linalg.inv(np.transpose(train_input).dot(train_input)+len(train_input)*l*np.eye(train_input.shape[1]))

w=inv.dot(np.transpose(train_input).dot(train_label))
print(w)

pred=train_input.dot(w)    
error=np.mean(np.square(train_label-pred))
print('\nMean Squared Error for training set:%.8f' %error)

pred=test_input.dot(w)    
error=np.mean(np.square(test_label-pred))

print('\nMean Squared Error for testing set:%.8f' %error)
[2.53511683 1.004805   2.45503268 0.01672721]

Mean Squared Error for training set:4.22093582

Mean Squared Error for testing set:4.46394639

Gradient descent with L2-regularization for Dataset 2¶

This works just like the non-regularized case, except that the gradient now is $$\nabla_w Loss_w=\dfrac{2}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)x_i+2\lambda w$$

In [ ]:
w=solve_by_gradient_descent(train_input,train_label,early_drop=False,rate=0.0008,num_epochs=400000,l2=0.5)
pred=test_input.dot(w)    
error=np.mean(np.square(test_label-pred))
print('\nMean Squared Error for testing set:%.8f' %error)
Solving by gradient descent using 
L2-regularization penalty=0.5000 
L1-regularization penalty=0.0000

Last loss calculated
10.952893661577539

Last gradient vector
[-2.43360887e-13  8.28226376e-14  2.64233080e-13 -1.52308721e-15]

Final value of weight vector w:
[2.53511683 1.004805   2.45503268 0.01672721]

Mean Squared Error for training set:4.22093582

Mean Squared Error for testing set:4.46394639

Why should we use L2 regularization?¶

We see above that L2 regularization gives us lower testing error than unregularized regression, despite having higher training error. As you can observe, weights of features $x$ and $s$ add up to the same amount ~5 in both L2 regularized and unregularized regression due to their equal contribution to $z$. However, these weights are smaller and similar to each other with Ridge regression due to L2 penalty. Ridge regression has better performance because it is not overfitting to training dataset as its weights are smaller.

Ridge regression is useful for preventing overfitting, but they are not very good at feature selection. Note that $t$ has vrtually no contribution (1e-6) to $z$ but both ridge and linear regression show a much larger weight (~1.6e-2) for the featutre $t$. Ridge regression scales weights down linearly but never makes them zero (or almost zero). Moreover, it doesn't remove redundant features. We know that features $x$ and $s$ are essentially the same, therefore one of them is redundant. Ridge regression will not get rid of this redundancy, in fact it will prefer to keep it in this case. This is because if we distribute weights equally to $x$ and $s$, i.e. give them each 2.5, we get $\lambda(2.5^2+2.5^2)=12.5\lambda$ L2 penalty, which is lower thyan giving only one feature the entire weight of 5 in whihc case the L2 penalty is $25\lambda$.

This is where L1-regularization comes in.

L1-regularization (LASSO Regression)¶

For a $\delta>0$, consider the following modified loss $$Loss_w =\dfrac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2+\delta \sum_{j=1}^k |w_j|=\dfrac{1}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)^2 +\delta \sum_{j=1}^k |w_j|$$

And we want to find $$\hat{w}=min_{w\in \mathbb{R}^k} Loss_w$$

$Loss_w$ is a real valued function dependent on $w$.

This problem does not have a closed form solution, so we will need gradient descent.

$$\nabla_w Loss_w=\dfrac{2}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)x_i+\delta \cdot sign(w)$$

Gradient Descent with L1-regularization for Dataset 2¶

In [ ]:
w=solve_by_gradient_descent(train_input,train_label,early_drop=False,rate=0.0008,num_epochs=400000,l1=2)
pred=test_input.dot(w)    
error=np.mean(np.square(test_label-pred))
print('\nMean Squared Error for testing set:%.8f' %error)
Solving by gradient descent using 
L2-regularization penalty=0.0000 
L1-regularization penalty=2.0000

Last loss calculated
16.090378483244212

Last gradient vector
[-1.62531677e-02  2.60706725e-05  1.62424026e-02  1.72649139e-05]

Final value of weight vector w:
[4.99103690e+00 9.96157412e-01 5.54013947e-04 5.57531970e-03]

Mean Squared Error for training set:4.10373083

Mean Squared Error for testing set:4.47741986

Why use L1 regularization?¶

Unlike L2 regularization, L1 regularization shrinks weights towards zero and produces a sparse weight vector. In particular, if there are some redundant or highly correlated features, then it will lean towards including only one or fewer in the model and make weights of other corelated features close to zero if the regularization coefficient is large enough. We can see in the example above that $s$ has a very low or almost zero weight, and the entire weight for both $x$ and $s$ is attributed to $x$. Moreover, the weight for $t$ is 5.5e-3, which is much smaller than in the case of Ridge or linear regression.

However, we may not always want to remove redundant or corelated features. In fact, it is sometimes good to know which features might be correlated in a model. By looking at LASSO regression result, we may think that $s$ is not an important variable or has no predictive ability, which is not the case. Wherea in Ridge regression,we get to keep both $x$ and $s$. In fact, for cases in which number of features $p>>n$ number of features, LASSO regression would only select up to $n$ features.

We can use elastic nets, presented next, to strike a good balance.

Elastic Net¶

Elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge regressions.

A simple way to implement elastic net is to add both L1 and L2 penalties with their own regularization coeeficients. In this case, the loss becomes: $$Loss_w =\dfrac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2+\lambda \sum_{j=1}^k w_j^2 +\delta \sum_{j=1}^k |w_j| =\dfrac{1}{n}\sum_{i=1}^{n}(w^Tx_i-y_i)^2 +\lambda \sum_{j=1}^k w_j^2 +\delta \sum_{j=1}^k |w_j|$$

where $\lambda \geq 0$ is L2 regularizer, and $\delta \geq 0$ is L1 regularizer. We get both LASSO and Ridge regression as special cases of elastic nets by setting $\delta$ and $\lambda$ equal to zero respectively.

This loss function is strongly convex due to presence of quadratic L2 penalty. We will solve this using gradient descent.

Gradient Descent with Elastic Net for Dataset 2¶

In [ ]:
w=solve_by_gradient_descent(train_input,train_label,early_drop=False,rate=0.0001,num_epochs=800000,l1=2,l2=0.5)
pred=test_input.dot(w)
error=np.mean(np.square(test_label-pred))
print('\nMean Squared Error for testing set:%.8f' %error)
Solving by gradient descent using 
L2-regularization penalty=0.5000 
L1-regularization penalty=2.0000

Last loss calculated
22.952967690877948

Last gradient vector
[-2.20623519e-12  5.04929432e-13  2.05524486e-12 -1.33226763e-15]

Final value of weight vector w:
[2.5344403  0.99842602 2.45042448 0.00510151]

Mean Squared Error for training set:4.26375892

Mean Squared Error for testing set:4.42632330

In this example, Elastic Net outperforms linear, Ridge and LASSO regressions. It is able to keep both $x$ and $s$ features, and at the same time has reduced weight for $t$ to 5e-3 just like the case of LASSO regression. Also, note how the last gradient vector in this case is much smaller than in the case of just LASSO. Adding L2 penalty stabilizes L1 regularization and converges faster due to strong convexity of combined penalties. Strong convexity also encourages grouping or selection of highly correlated features as discussed in Ridge regression section.

See this link for more discussion on this topic.